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Abstract 

This  paper  describes  a  unified,  element  based  Galerkin  (EBG)  framework  for 
a  three-dimensional,  nonhydrostatic  model  for  the  atmosphere.  In  general, 
EBG  methods  possess  high-order  accuracy,  geometrical  flexibility,  excellent 
dispersion  properties  and  good  scalability.  Our  nonhydrostatic  model,  based 
on  the  compressible  Euler  equations,  is  appropriate  for  both  limited-area 
and  global  atmospheric  simulations.  Both  a  Continuous  Galerkin  (CG),  or 
spectral  element,  and  discontinuous  Galerkin  (DG)  model  are  considered  us¬ 
ing  hexahedral  elements.  The  formulation  is  suitable  for  both  global  and 
limited-area  atmospheric  modeling,  although  we  restrict  our  attention  to  3D 
limited-area  phenomena  in  this  study;  global  atmospheric  simulations  will  be 
presented  in  a  follow-up  paper.  Domain  decomposition  and  communication 
algorithms  utilized  by  both  our  CG  and  DG  models  are  presented.  The  com¬ 
munication  volume  and  exchange  algorithms  for  CG  and  DG  are  compared 
and  contrasted.  Numerical  verification  of  the  model  is  performed  using  two 
test  cases:  flow  past  a  3D  mountain  and  buoyant  convection  of  a  bubble  in 
a  neutral  atmosphere;  these  tests  indicate  that  both  CG  and  DG  can  simu¬ 
late  the  necessary  physics  of  dry  atmospheric  dynamics.  Scalability  of  both 
methods  is  shown  up  to  8192  CPU  cores,  with  near  ideal  scaling  for  DG  up 
to  32768  cores. 
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1.  Introduction 


As  the  resolution  of  numerical  weather  prediction  (NWP)  models  increase, 
nonhydrostatic  effects  become  relevant.  Almost  all  current  limited-area  mod¬ 
els  utilize  a  nonhydrostatic  core,  while  global  models  are  currently  transition¬ 
ing  from  the  hydrostatic  to  the  nonhydrostatic  regime.  A  host  of  challeng¬ 
ing,  non-trivial  numerical  problems  arise  when  one  enters  the  nonhydrostatic 
regime:  these  include  1)  choosing  the  appropriate  equation  set  to  ensure  ef¬ 
ficiency,  accuracy,  and  conservation,  2)  effectively  resolving  multi-scale  flow 
features,  that  may  require  adaptive  mesh  refinement  (AMR),  3)  developing 
efficient  time-integrators  and/or  “soundproof’  [23,  30]  equation  sets  to  con¬ 
front  the  fast  acoustic  and  gravity  waves,  and  4)  developing  scalable  parallel 
codes  for  shared,  distributed,  and  hybrid  architectures  based  on  items  1)  -3). 

Virtually  all  current  nonhydrostatic  NWP  models  are  based  on  a  com¬ 
bination  of  finite- difference  discretization  in  space  and  either  split-explicit 
or  semi- implicit  (i.e.,  implicit-explicit  or  IMEX)  discretization  in  time.  Ex¬ 
amples  include  WRF  [24]  (NCAR),  Lokal  Modell  [32]  (DWD),  COAMPS 
[19]  (US  Navy),  and  UM  [3]  (UK  Met  Office).  Although  finite  difference 
schemes  are  very  efficient,  they  may  suffer  from  several  problems,  including: 
dispersion  error  (due  to  low-order  approximations),  geometrical  inflexibility, 
and  lack  of  scalability  to  large  (e.g.  tens  of  thousands)  of  processors.  To 
overcome  some  of  the  limitations  of  finite-difference  approximations,  several 
emerging  NWP  models  utilize  finite-volume  spatial  discretization,  such  as 
MPAS  [39]  and  MCORE  [40],  which  utilize  polynomial  reconstruction  of  the 
inter-element  fluxes.  In  order  to  achieve  higher-order  accuracy,  these  finite- 
volume  based  methods  utilize  a  larger  halo,  which  may  impede  scalability  at 
high  processor  counts.  Other  approaches  within  a  finite  volume  framework 
include  the  Flux-Based  Characteristic  Semi-Lagrangian  (FBCSL)  method 
[29],  which  uses  the  method  of  characteristics  to  enable  larger  time-steps 
that  may  present  an  obstacle  to  scalability. 

Alternatively,  element-based  Galerkin  (EBG)  methods  have  recently  been 
proposed  for  several  next  generation  non-hydrostatic  NWP  models  [14,  15, 
31],  as  well  as  for  hydrostatic  models  using  both  continuous  Galerkin  (CG) 
[7,  6]  and  discontinuous  Galerkin  (DG)  [28]  formulations.  We  note  that  the 
hydrostatic  models  are  only  valid  for  horizontal  resolutions  coarser  than  10 
km  and  for  this  reason  are  not  viable  options  for  mesoscale  (regional)  mod¬ 
eling.  EBG  methods  possess  several  desirable  attributes  for  future  NWP 
nonhydrostatic  models,  such  as  high-order  accuracy,  geometrical  flexibility, 
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whereby  the  solver  is  completely  independent  of  the  grid,  excellent  dispersion 
properties  [26]  and  minimal  communication  overhead  within  a  parallel  im¬ 
plementation.  High-order  accuracy,  which  allows  fine-scale  atmospheric  flow 
features  to  be  resolved,  is  achieved  by  representing  the  prognostic  variables 
via  a  polynomial  basis  function  expansion  within  each  element.  Geometri¬ 
cal  flexibility,  which  is  inherent  to  all  EBGs  (both  low  and  high  order),  is 
advantageous  since  any  terrain-following  coordinate  may  be  used  within  the 
same  solver;  also,  both  static  and  dynamic  adaptivity  may  be  retro-fitted  to 
the  existing  solver.  Finally,  in  a  distributed  memory  environment  (e.g.  clus¬ 
ter),  low  communication  overhead  is  critical  to  maintain  linear  scalability  to 
hundreds  of  thousands  of  processor  cores.  In  our  previous  work,  high-order 
accuracy  and  geometrical  flexibility  were  addressed  within  explicit  [14]  and 
IMEX  [15]  frameworks  for  2D  (x-z  slices)  problems  using  a  serial  implemen¬ 
tation.  However,  parallel  implementation  was  not  explicitly  addressed.  The 
purpose  of  the  present  work  is  to  extend  the  work  begun  in  [14],  [31],  and 
[15]  to  realistic  three-dimensional  (3D)  domains  using  a  parallel,  MPI-based 
implementation. 

The  present  work  is  guided  by  a  need  for  highly  scalable  models  in 
distributed  memory  environments.  In  the  past  five  years,  clock  speeds  of 
processors  have  remained  stagnant;  to  achieve  increased  floating-point  per¬ 
formance,  chip  manufacturers  have  developed  multiple  core  processors  that 
allow  many  threads  to  execute  in  parallel.  In  tandem,  high  performance 
computing  (HPC)  has  evolved  towards  clusters  with  processors  counts  ex¬ 
ceeding  100,000.  As  we  approach  the  exaflop  era,  core  counts  are  expected 
to  approach  1,000,000.  Therefore,  next-generation  NWP  models  must  be 
based  upon  scalable  numerical  methods  that  allow  arbitrarily  large  proces¬ 
sor  counts  with  minimal  communication  overhead.  EBG  methods,  both  CG 
and  DG,  have  proved  effective  in  this  respect  in  modeling  massive  biological 
flow  [17],  modeling  the  hydrostatic  atmosphere  [11,  16],  incompressible  flows 
using  low-order  finite  elements  [20],  and  geodynamical  problems  [41], 

This  paper  presents  a  scalable,  3D  nonhydrostatic  atmospheric  model 
based  on  a  unified  element  based  Galerkin  (EBG)  method  targeted  toward 
distributed  memory  architectures;  to  our  knowledge,  these  are  the  first  3D 
continuous  Galerkin  (spectral  element)  and  discontinuous  Galerkin  models 
for  a  nonhydrostatic  atmosphere.  We  are  developing  a  unified  dynamical  core 
appropriate  for  both  mesoscale  and  global  simulations;  this  nonhydrostatic 
unified  model  of  the  atmosphere  we  call  NUMA.  The  current  implementa¬ 
tion  utilizes  tensor  products  of  Lagrange  polynomials  in  a  hexahedral  grid  for 
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maximum  computational  efficiency;  however,  more  flexible  grids  based  on  ei¬ 
ther  tetrahedra  or  triangular  prisms  may  be  incorporated  into  future  versions. 
The  remainder  of  this  paper  is  structured  as  follows.  In  Section  2,  we  formu¬ 
late  the  nonhydrostatic  compressible  Euler  equations  (Set  2NC  and  2C  from 
[15]),  which  constitute  the  governing  equations  of  our  dynamical  core.  To 
ensure  numerical  stability,  these  nonhydrostatic  equations  are  solved  about  a 
hydrostatic  base  state.  In  Section  3,  we  present  both  the  continuous  and  dis¬ 
continuous  Galerkin  discretization,  along  with  the  explicit  time-integrator, 
boundary  conditions,  and  artificial  diffusion.  Section  4,  which  forms  the  core 
of  the  paper,  outlines  the  parallelization  algorithm  for  both  the  CG  and  DG 
methods.  The  communication  volume  and  memory  access  patterns  of  both 
CG  and  DG  are  compared  in  this  section.  Numerical  results  for  a  3D  linear 
hydrostatic  mountain  and  a  3D  rising  thermal  bubble  are  shown  in  Section 
5  along  with  the  results  of  the  scalability  experiments.  Scalability  is  demon¬ 
strated  to  8192  processor  cores  for  both  the  CG  and  DG  methods,  with  near 
ideal  scaling  for  DG  up  to  32768  cores. 

2.  Governing  Equations 

We  consider  the  fully  compressible,  nonhydrostatic  Euler  equations  in 
both  conservative  and  non-conservative  form.  These  equations  (Sets  2NC  and 
2C),  which  are  valid  for  spatial  resolutions  finer  than  10  km,  have  previously 
been  considered  in  [14]  for  2D  limited-area  atmospheric  flows.  Set  2NC  is 
considered  within  a  continuous  Galerkin  (CG)  framework,  whereas  Set  2C  is 
considered  within  a  discontinuous  Galerkin  (DG)  framework. 

2.1.  Non- Conservative  Form  (Set  2NC) 

Of  the  five  equation  sets  considered  in  [15],  set  2NC  proved  to  be  both 
computationally  efficient  and  provides  acceptable  mass  and  energy  conserva¬ 
tion  properties;  in  addition,  replacing  the  advection  operator  in  the  momen¬ 
tum  equation  allows  for  formal  conservation  of  energy  up  to  time  truncation 
error.  Both  diabatic  forcing  and  the  effects  of  moisture  are  neglected;  in 
other  words,  we  consider  a  dry  dynamical  core  without  sub- grid  scale  tur¬ 
bulence  closure.  In  the  present  study,  we  consider  three-dimensional  flow 
in  Cartesian  coordinates  ( x-y-z )  subject  to  gravitational  and  Coriolis  forces, 
yielding 

+  V  ■  (pu)  =  0  (la) 
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with  the  prognostic  variables  defined  as  (p,  ur,0),  where  p  is  density,  u  = 
(v,  v ,  w)T  is  velocity,  and  0  is  potential  temperature;  the  superscript  T  de¬ 
notes  the  transpose  operator.  In  addition,  P  is  pressure,  g  is  the  gravitational 
constant,  f  =  2Qk  is  the  Coriolis  parameter  (with  Q  the  angular  frequency 
of  the  earth),  and  k  is  the  unit  vector  in  the  z  direction.  Eq.  (la)  enforces 
mass  conservation,  Eq.  (lb)  enforces  conservation  of  momentum,  and  Eq. 
(lc)  enforces  conservation  of  entropy.  To  close  the  system  of  conservation 
laws  given  by  Eq.  (1),  a  thermodynamic  equation  of  state  is  required.  We 
utilize  the  ideal  gas  law  given  by 


P  =  Pa 


pR9  V 


where  Pa  is  the  atmospheric  pressure  at  ground,  R  =  cp 


(2) 

:v  is  the  ideal  gas 

constant,  and  7  =  ^  ~  1.4  is  the  ratio  of  specific  heats.  In  three  dimensions, 
Eqs.  (1)  and  (2)  constitute  a  closed  system  of  nonlinear  partial  differential 
equations  in  five  unknowns. 

To  facilitate  the  solution  of  the  compressible  Euler  equations  and  maintain 
numerical  stability,  we  split  the  density,  pressure,  and  potential  temperatures 
about  their  mean  hydrostatic  values: 


p(x,  z,  y,  t)  =  Po(z)  +  p'{x,  y,  z,  t ) 

(3a) 

9(x,  y,  z,  t )  =  60(z)  +  &{x,  y,  z,  t ) 

(3b) 

P(x,  y,  z,  t )  =  P0(z)  +  P\x,  y,  z,  t ) 

(3c) 

where  po,  6*o,  and  Pq  are  the  hydrostatic  reference  states.  For  all  the  test 
cases  considered  in  this  paper,  the  reference  states  are  functions  of  the  vertical 
coordinate  z\  however,  more  sophisticated  test  cases  require  reference  states 
that  are  functions  of  x ,  y,  and  z.  Inserting  Eq.  (3)  into  Eq.  (1)  and  applying 
hydrostatic  balance 

dPo  m 

=  -pog  (4) 


dz 


yields  the  system 
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Defining  a  solution  vector  q  =  (//,  ur,  0')T ,  Eq.  (5)  is  written  in  condensed 
form  as 

f  =  S2"C(q)  (6) 

where  ,S’(q)  is  a  nonlinear,  first-order  differential  operator. 


2.2.  Conservative  Form  (Set  2C) 

Eq.  (1)  may  be  written  in  flux,  or  conservative  form 

dp 

t£  +  V-U  =  0 

ot 
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complemented  by  the  equation  of  state 

fRoy 
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\Pa) 


(7a) 

(7b) 
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with  the  prognostic  variables  defined  as  (p,  Ur,  0)r,  where  p  is  density,  U  = 
pn  is  momentum,  u  =  ( u,v,w)T  is  velocity,  and  0  =  pO  is  density  potential 
temperature.  In  addition,  the  operator  ®  denotes  the  tensor  product  and  I3 
is  the  rank-3  identity  matrix.  The  variables  in  these  equations  are  split  in 
a  similar  fashion  to  Eq.  (1)  except  that  we  now  split  the  density  potential 
temperature  as  follows 


0(x,  y,  z,  t )  =  0o(z)  +  0\x ,  y ,  z,  t). 
Applying  the  hydrostatic  decomposition  to  Eq.  (7)  yields 

dp' 


dt 


+  V-U  =  0 
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U  0  u  ,  \ 

— —  +  P' I3  +  p'gk  +  f  x  U  =  0 
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(10b) 


dt 

which  is  written  in  vector  form  as 


<90'  „ 

^— +  V- 


(10c) 


^  +  V  ■  F(q)  =  5(q) 


(11) 


where  F(q)  is  the  flux  tensor  and  the  gravitational  and  Coriolis  terms  are 
incorporated  into  <S(q).  Finally,  we  can  write  this  equation  in  the  condensed 
form  as 


(12) 


3.  Numerical  Methods 

In  this  section,  we  briefly  discuss  the  numerical  discretization  used  by 
NUMA  which  includes:  the  spatial  discretization  of  Eq.  (1)  using  a  continu¬ 
ous  Galerkin  (CG)  method  and  Eq.  (7)  using  a  discontinuous  Galerkin  (DG) 
method  (Sec.  3.1);  the  explicit  time-integrator  used  to  evolve  the  solution 
forward  in  time  (Sec.  3.2);  the  necessary  boundary  conditions  required  for 
limited-area  problems  (Sec.  3.3);  and  the  artificial  diffusion  used  to  control 
overshoots  (Sec.  3.4). 

3.1.  Spatial  Discretization 

Let  us  now  describe  the  approximation  of  the  continuous  spatial  operators 
using  both  CG  and  DG.  We  begin  with  a  description  of  the  decomposition 
of  the  global  spatial  domain  into  elements,  the  basis  functions  defined  within 
these  elements,  and  element-wise  integrals. 

3.1.1.  Basis  Functions,  Elements,  and  Integrals 

Element-based  Galerkin  methods  such  as  the  continuous  Galerkin  and 
discontinuous  Galerkin  methods  require  the  decomposition  of  the  global  do¬ 
main  f2  C  7 Z3  into  Ne  non-overlapping  elements  fle  via 


(13) 


e=l 
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In  the  current  formulation  of  NUMA,  we  let  be  hexahedra,  which  pro¬ 
vides  simple  grid  generation  and  efficient  (fast)  evaluation  of  the  necessary 
differentiation  and  integration  operators.  We  note,  however,  that  Qe  may  be 
replaced  by  tetrahedra,  pyramidal  elements,  or  triangular  prisms,  in  a  future 
version  of  NUMA;  while  this  can  be  done  for  both  CG  and  DG,  we  envision 
doing  this  only  for  the  DG  version  since  DG  does  not  require  global  matrices 
to  be  stored.  CG  would  require  the  inversion  of  a  sparse  global  mass  matrix 
which  would  be  prohibitive  in  three-dimensions.  We  also  note  that  tetrahe¬ 
dral  and/or  pyramidal  elements  would  produce  an  unstructured  grid  in  the 
vertical,  which  would  impede  the  incorporation  of  column-based  microphysics 
and  physical  parameterizations.  Finally,  sum- factorization  may  be  used  with 
hexahedral  elements,  reducing  the  complexity  of  a  3D  method  based  on  N- 
th  order  polynomials  from  O  ( Afi )  to  O  (A1),  making  EBG  methods  one  to 
three  orders  of  magnitude  more  efficient  for  typical  values  of  3  <  N  <  10. 

Letting  the  unit  cube  (£,  77,  £)  e  E  =  [—1,  l]3  be  the  reference  hexahedral 
element,  a  transformation  jFe  :  f2 e  ^  E  mapping  physical  space  to  computa¬ 
tional  space  is  defined  for  each  element,  yielding  (x,  y,  z )  =  W(C  y,  C)  where 
the  associated  Jacobian  of  Te  is  denoted  by  Je. 

Within  each  element  Qe ,  a  finite- dimensional  approximation  q  v  is  formed 
by  expanding  q(x,  t)  in  basis  functions  ipj  (x)  such  that 


mn 

qjv}(x,  t)  =  Y^  ^(x)q fit)  (14) 

3= 1 

where  Mjv  =  (N  +  l)3  is  the  number  of  nodes  per  element,  N  is  the  order 
of  the  basis  functions,  and  the  superscript  (e)  denotes  element-wise  (local) 
values.  For  basis  functions,  we  construct  tensor  products  of  Lagrange  poly¬ 
nomials  given  by 

xpi  (x)  =  ha(£)  <S>  hp{rj)  ®  h7(C)  (15) 

where  ha(£)  is  the  Lagrange  polynomial  associated  with  the  Legendre-Gauss- 
Lobatto  (LGL)  points  and  (^,77,  ()  are  functions  of  the  physical  variable 
x.  These  LGL  points  satisfy 


(1  -  ap;«)  =  0 


(16) 


where  Pn{0  is  the  AT-th  order  Legendre  polynomial.  We  can  now  use  all  of 
this  machinery  defined  to  construct  discrete  approximations  of  the  continuous 


spatial  derivatives.  For  example,  the  derivative  of  q  can  be  obtained  by 
differentiating  Eq.  (14)  as  follows 

mn 

Vqji(x,  t)  =  ^2  V^x(x)qJe)(t)  (17) 

3= 1 

which,  after  multiplying  by  the  test  (basis)  function  ^  and  integrating  within 
an  element  yields 

/  'i/ji'^2V'ijjjx(-x)c^e\t)dne  (18) 

”  j  —  l 

where  we  shall  use  Nth-degree  LGL  integration  points.  We  will  make  use  of 
these  definitions  in  order  to  define  the  CG  and  DG  approximations. 


3.1.2.  Continuous  Galerkin 

For  CG  we  shall  use  Eq.  (1)  as  our  governing  equations.  Using  Eq.  (14) 
to  approximate  q;v,  multiplying  by  a  test  function  and  integrating  as  in  Eq. 
(18)  yields  the  element-wise  definition  of  the  problem 

[  =  f  ^iS2NC  {(\n)  dQ.e  (19) 

Jae  ot  Jne 

and  applying  the  global  assembly,  or  direct  stiffness  summation  (DSS)  oper¬ 
ator  required  by  all  CG  methods  yields  the  weak  formulation  for  CG:  find 
q,\  G  VGG  such  that 

f  f  i>iS2NC  {Qn)  dGL  V^eV^G  (20) 

Jn  ut  Jn 

where 

Vcng  =  G  H\Q)  :  iP  G  PN(I),  e  =  1, . . . ,  Ne}  (21) 

and  PN  denotes  the  space  of  all  polynomials  of  degree  N.  Notice  that  the 
requirement  'ip  G  Hl  (Q)  implies  V$G  C  C'°(G).  The  DSS  operator  A^i 
forms  global  matrices  from  the  element  (local)  matrices.  For  example,  for 
the  local  mass  matrix  defined  as 

M$=  [  A^dQe  (22) 

Jne 
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the  DSS  operator  produces  the  global  mass  matrix 

Ne  Ne  , 

Mtj  =/\M^  =  /\  dtte  (23) 

e=l  e=l 

where  the  DSS  sums  the  contribution  of  all  the  elements  e  =  1, Ne  and 
i  =  1 , ,Mpj  and  stores  them  in  the  global  grid  points  I  =  1, . . . ,  Np  as 
follows  (i,e)  —*■  I.  Extracting  the  global  mass  matrix  from  Eq.  (19)  and 
writing  the  operator  S'  as  a  vector,  results  in  the  following  compact  matrix- 
vector  form 

^  =  MT}SSC(  q)  (24) 

where  the  mass  matrix  Mu  =  jQ  ipiipj  dfl  is  diagonal  because  the  interpola¬ 
tion  and  integration  points  have  been  carefully  chosen  to  be  co-located;  this 
approximation  is  valid  for  IV  >  4  while  incurring  a  small  error  in  integration 
[13].  Denoting  the  right-hand  side  (RHS)  of  Eq.  (24)  by  Ri( q),  Eq.  (24)  is 
expressed  as 

o  Ne 

=  A  (qf)  •  (25) 

e=l 

Note  that  the  DSS  operator  maps  local,  element-wise  coordinates  (i,e)  to 
global  coordinates  I.  Eq.  (25)  forms  the  core  of  the  spectral  element  method, 
allowing  local,  element-wise  information  qf1  to  propagate  to  adjacent  ele¬ 
ments  via  the  DSS  operator. 


3.1.3.  Discontinuous  Galerkin 

For  DG  we  use  Eq.  (7)  because  the  DG  method  requires  the  equations 
to  be  in  conservation  form.  We  have  also  developed  a  CG  code  with  Set  2C 
but  we  use  Set  2NC  for  CG  because  it  represents  the  optimal  equation  set 
for  our  needs  as  described  in  [15]. 

Beginning  with  Eq.  (11),  using  the  basis  function  expansion,  Eq.  (14), 
multiplying  by  a  test  function,  and  integrating  element-wise  yields 

/  A^-dQe+  [  ipiV  ■  F('e\qN)dDe  =  f  ^ e)(qN)dDe .  (26) 

JQe  ut  Jne 

Integrating  the  divergence  of  the  flux  by  parts  yields  the  weak  formulation 
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for  DG:  find  E  V^G 


r  f)  (e)  N faces  „  „ 

/  +J2  An{e’k) -F^k\qN)dre- 

Jne  ot  JVe  Jne 


'Me 


^s(e)(q N)dne  V^GV 


DG 

N 


V*  ■  F(e)(qjv)<ia 
(27) 


where  the  DG  finite-dimensional  space  is  defined  as 
V§G  ={</>£  L2(Q)  :  v  6  PN(I)>  e  =  l 


(28) 


Notice  that,  contrary  to  Eq.  (21),  there  is  no  global  continuity  requirement, 
so  that  VGG  <£.  C'°(0).  This  is  possible  because  in  Eq.  (27)  differentiability 
is  required  separately  within  each  element,  and  not  within  the  entire  domain 
Q.  The  coupling  between  neighboring  elements  is  then  recovered  through 
the  numerical  flux  (or  Riemann  solver)  F^e,k\  which  is  required  to  be  a  single 
valued  function  on  the  inter-element  boundaries  and  the  precise  definition  of 
which  is  given  below. 

For  simplicity  we  use  the  Rusanov  flux  defined  as 

F(e,fc)  =  1  [F(e)  +  F(fe)  _  n {e'k)cmax  (q(fc)  -  q(e))]  (29) 

where  cmax  is  the  maximum  wave  speed  of  the  Euler  equations  (i.e.,  maximum 
eigenvalue  of  the  Jacobian  matrix)  which  turns  out  to  be  cmax  =\  n  •  u  |  +a 
where  a  denotes  the  speed  of  sound.  Note  that  other  types  of  numerical 
fluxes  are  possible  including  multi- dimensional  Riemann  solvers  (e.g.,  [9]). 
We  have  initially  chosen  this  simple  Rusanov  flux  because  it  vastly  simplifies 
the  parallel  strategy.  Using  a  one-dimensional  flux,  as  we  have,  reduces 
the  element  communication  to  edge  neighbors  but  restricts  the  maximum 
allowable  time-step.  Conversely,  using  a  multi- dimensional  Riemann  solver 
will  increase  the  maximum  allowable  time-step  but  will  also  increase  the  size 
of  the  communication  stencil.  The  vices  and  virtues  of  this  approach  need  to 
be  studied  in  detail.  Such  a  study  is  beyond  the  scope  of  the  present  work; 
we  leave  this  topic  for  future  work. 


3.2.  Explicit  RK  methods 

We  implemented  a  strong  stability  preserving  (SSP)  Runge-Kutta  third- 
order,  five  stage  time-integrator  (RK35)  proposed  in  [37] .  This  time- integrator 


If 


is  stable  for  (acoustic)  Courant  numbers  of  1.3  or  less  when  using  a  continu¬ 
ous  Galerkin  discretization,  whereas  for  DG,  the  time-integrator  is  stable  for 
(acoustic)  Courant  numbers  of  0.85  or  less. 

We  emphasize  that  this  explicit  RK  method  will  not  be  used  in  an  op¬ 
erational  setting  due  to  the  stringent  CFL  requirement;  rather,  we  are  de¬ 
veloping  IMEX  methods  which  allow  much  larger  time-steps.  Nevertheless, 
this  particular  time  integrator  was  chosen  for  the  following  reasons:  1)  to 
provide  high-order  accuracy  in  time  to  complement  the  high-order  spatial 
discretization,  2)  to  minimize  the  amount  of  numerical  dissipation,  3)  to  al¬ 
low  a  relatively  larger  time-step  (with  respect  to  explicit  methods),  and  4)  to 
provide  stable  solutions  for  both  CG  and  DG  methods.  Criteria  1)  and  2)  are 
necessary  for  resolving  fine-scale  flow  features  present  in  many  atmospheric 
phenomena,  whereas  criterion  3)  is  necessary  for  model  efficiency,  and  crite¬ 
rion  4)  is  required  in  order  to  facilitate  the  comparison  of  CG  and  DG.  In 
a  previous  study  involving  nonhydrostatic  modeling  [5],  all  available  third- 
order  multi-stage  methods  in  addition  to  an  assortment  of  multi-step  methods 
were  compared  and  RK35  was  found  to  be  the  most  efficient.  In  addition, 
RK35,  unlike  leapfrog,  does  not  have  a  computational  mode  and  therefore 
does  not  require  a  time  filter  (DG  also  would  not  work  with  leapfrog  due  to 
the  eigenvalues  for  DG  not  residing  along  the  imaginary  axis  as  they  do  for 
CG).  Note,  however,  that  within  a  parallel  setting,  each  RK  stage  requires 
parallel  communication,  thus  making  this  choice  of  time-integrator  relatively 
expensive.  However,  this  extra  expense  is  tolerated  in  order  to  maintain 
high-order  accuracy  in  time.  It  should  also  be  mentioned  that  we  have  cho¬ 
sen  an  explicit  method  in  order  to  simplify  and  focus  the  discussion  of  CG 
and  DG  with  respect  to  scalability.  The  optimal  scalability  will  be  achieved 
by  an  explicit  method.  A  study  on  the  scalability  of  IMEX  methods  requires 
a  discussion  of  a  number  of  topics  including:  implicit  time-integrators,  the 
choice  of  iterative  solver,  and  specific  preconditioning  strategies;  we  leave 
this  for  future  work. 

3.3.  Boundary  Conditions 

Limited-area  atmospheric  models,  such  as  mesoscale  codes,  typically  re¬ 
quire  two  types  of  boundary  conditions:  no-flux  boundary  conditions  (NF- 
BCs),  that  mimic  impenetrable  objects  (e.g.,  the  ground)  and  non- reflecting 
boundary  conditions  (NRBCs),  that  mimic  an  infinite  domain  (e.g.  the  top 
of  the  atmosphere)  by  allowing  waves  to  propagate  out  of  the  computational 
domain  without  generating  spurious  reflections.  In  this  section,  we  outline 
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how  NFBCs  and  NRBCs  are  imposed  for  both  CG  and  DG.  For  additional 
details,  see  [14,  15]. 

3.3.1.  No-Flux  Boundary  Conditions 

All  of  our  test  cases  utilize  NFBCs  on  the  bottom  boundary,  while  some 
test  cases  (such  as  the  rising  thermal  bubble)  utilize  NFBCs  on  other  bound¬ 
aries  as  well.  For  CG,  we  enforce 


n  •  u  =  0  (30) 

for  all  points  on  the  boundary  F,  where  n  is  the  outward  pointing  unit  normal 
vector  on  T.  In  order  to  apply  the  NFBC  to  the  prognostic  vector  q,  we 
augment  n  to  R5  via  n  =  (0,  nr,  0)r,  yielding  n  •  q  =  0.  To  apply  theses 
boundary  conditions  in  the  strong  sense,  we  construct  a  3  by  3  projection 
matrix  P  via 

(1  ~n2x  -nxny  -nxnz  \ 

-nynx  1  -  riy  -nynz  .  (31) 

-nznx  -nzny  1  -  n2z  J 

This  matrix  is  constructed  during  the  initialization  phase  and  applied  to  the 
RHS  operator  after  each  time- step. 

On  the  other  hand,  for  DG  the  NFBCs  are  imposed  via  the  numerical  flux 
as  follows.  On  boundary  edges  where  NFBCs  are  to  be  imposed,  a  ghost-cell 
is  constructed  on  the  other  side  of  the  wall  that  has  a  velocity  vector  the 
negative  of  the  interior  element.  In  other  words, 

Xj(fc)  —  _  u(e) 

where  k  in  this  case  denotes  the  edge  that  is  positioned  on  a  no- flux  boundary; 
note  that  this  ensures  that  the  no-flux  condition  n  •  U  is  satisfied.  The  scalar 
variables  are  copied  directly  such  that  p(k)  =  f/e'> ,  etc. 

3.3.2.  Non- Reflecting  Boundary  Conditions 

In  an  operational  mesoscale  NWP  model,  the  four  lateral  and  the  top 
boundaries  should  mimic  an  open  domain.  That  is,  waves  should  smoothly 
exit  the  domain  without  reflection;  in  addition,  information  from  outside  the 
domain  should  be  allowed  to  enter  the  domain  of  interest.  Mathematically 
modeling  this  behavior  is  non-trivial  and  has  attracted  the  attention  of  re¬ 
searchers  in  many  disciplines  [4,  18,  27].  In  our  model,  we  utilize  a  simple, 
albeit,  effective  absorbing  sponge  layer  method.  The  computational  domain 
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is  surrounded  by  a  layer  with  Newtonian  relaxation  coefficients  «(x)  and 
/5(x)  such  that  a  =  1  and  (3  =  0  in  the  domain  of  interest,  while  a  — >  0  and 
f3  — »  1  as  the  boundary  is  approached.  Specifically,  for  the  top  boundary,  we 
choose 

/  \  4 


P  = 


z  -  zx 


Zt 


(32) 


where  zt  is  the  vertical  height  of  the  domain  and  zs  is  the  bottom  of  the  sponge 
layer.  Similar  functions  are  used  for  the  lateral  boundaries  of  the  domain. 
Once  the  sponge  layer  is  constructed,  the  numerical  solution  q  given  by  the 
RHS  operator  is  relaxed  to  some  known  solution  at  the  boundary  q/,  via 


q  =  oixjq  I  /f(xlq,,. 


(33) 


For  problems  under  consideration  in  this  paper,  q/,  is  a  far- field  condition 
(e.g.  known  wind  velocity  and  zero  perturbation  of  scalars). 


3-4-  Artificial  Diffusion 

To  add  a  certain  level  of  numerical  dissipation  we  have  implemented  sec¬ 
ond  order  artificial  diffusion  operators,  although  hyper- viscosity  is  also  pos¬ 
sible.  For  Eq.  (6)  we  add  the  following  right-hand-side  operator 

(  0 

zA7-(Vu) 

\  iA7  •  (W) 

and  for  Eq.  (12)  we  add  the  operator 

(  0 

I  uV  •  (pVu) 

V  vV  ■  (pW) 

Note  that  the  artificial  diffusion  operator  applied  to  the  potential  tempera¬ 
ture  equation  must  act  only  on  the  potential  temperature  perturbation,  O', 
because  if  it  is  applied  to  the  full  potential  temperature  variable  then  it  will 
smoothen  the  background  reference  field  which  must  remain  intact  in  order 
to  maintain  hydrostatic  balance. 


14 


4.  Parallel  Implementations  of  EBG  Methods 

NUMA  is  an  MPI-based  code  targeted  toward  distributed  memory  archi¬ 
tectures  (e.g.  clusters).  In  this  section,  we  discuss  the  parallel  implementa¬ 
tion  of  NUMA,  including:  a  description  of  the  domain-decomposition  (Sec. 
4.1),  communication  strategies  for  both  CG  and  DG  (Secs.  4.2  and  4.3),  a 
comparison  of  these  communication  volumes  (Sec.  4.4),  and  a  discussion  of 
memory  access  of  these  EBG  methods  (Sec.  4.5). 

4-1-  Domain  Decomposition 

We  decompose  fl  into  Np  processor  elements  (PE)  flp  that  consist  of  local 
elements  Fif).  Mathematically,  we  rewrite  Eq.  (13)  as 

NP  N(ep) 

«  =  U  U  (34> 

p= 1  e'=l 

where  Ne ^  is  the  number  of  local  elements  residing  on  PE  p.  Since  the 
DSS  operator  for  CG  requires  global  elements  e  and  the  flux  integrals  for 
DG  require  global  faces  /,  we  must  also  construct  local  to  global  mappings 
for  both  elements  e  =  LGE^p\e')  and  faces  /  =  LGF(p)  (/')  that  map  local 
elements  e'  and  faces  f  on  processor  p  to  global  elements  e  and  faces  / 
residing  on  the  global  domain  Q. 

A  guiding  principle  in  the  construction  of  NUMA  is  to  maintain  inde¬ 
pendence  between  grid  generation  and  the  CG/DG  solver;  hence,  domain 
decomposition  should  be  as  general  as  possible  and  not  be  constrained  by 
the  choice  of  grid.  Therefore,  we  have  implemented  a  domain  decomposition 
strategy  based  on  the  widely  used  METIS  graph  partitioning  library  [21]. 
METIS  requires  an  adjacency  graph  where  the  Ne  vertices  are  elements  fle 
and  the  “edges”  denote  the  connectivity  between  elements.  Since  the  DSS 
operator  requires  information  from  elements  that  share  nodes,  the  “edges” 
include  all  forms  of  geometric  connectivity  (faces,  edges,  and  vertices).  For 
maximum  flexibility,  we  construct  a  weighted  adjacency  graph  G'  =  (U,  E) 
with  adjacency  matrix  A’  of  size  Ne  by  Ne  defined  via 

{v  if  i  and  j  are  vertex  neighbors 
e  if  i  and  j  are  edge  neighbors  (35) 

1  if  i  and  j  are  faces  neighbors 
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where  v  and  e  are  arbitrary  weights.  Although  not  explored  in  this  paper, 
the  values  of  v  and  e  may  be  chosen  to  construct  a  machine-optimal  weighted 
adjacency  matrix.  Once  A'  is  constructed  the  adjacency  matrix  A  for  the 
graph  G  =  ( VI  F),  where  F  are  geometrical  faces,  is  simply  ai7  =  1  if  ah  =  1 
and  ctjj  =  0  otherwise.  Two  example  connectivity  graphs  for  a  2D  grid 
are  shown  in  Figure  1,  showing  both  edge  and  vertex  connectivity.  The 
associated  9  by  9  adjacency  matrix  has  nodes  with  degree  ranging  from  3  (for 
the  corner  nodes)  to  8  (for  the  central  node).  In  contrast,  the  flux  integrals 
only  require  face  adjacency  information;  hence,  our  DG  code  only  utilizes 
the  standard  adjacency  matrix  A.  Also  shown  in  Fig.  1  is  the  adjacency 
matrix  associated  with  DG,  which  only  shows  edge  (face)  adjacency.  These 
adjacency  matrices,  along  with  the  number  of  processors  Np  are  then  passed 
to  METIS,  which  returns  a  partition  V  :  V  —>{1,2 ,  ...,Npj,  that  maps  global 
elements  to  processors.  This  mapping  is  then  used  to  construct  the  local  to 
global  mappings  LG  necessary  for  global  assembly  in  Eq.  (25). 

In  addition  to  the  element  adjacency  graph  G,  a  processor-element  adja¬ 
cency  graph  Gp  =  ( Vp,Ep )  is  constructed,  where  the  vertices  Vp  are  pro¬ 
cessor  elements  and  the  edges  Ep  are  the  connectivity  between  processor 
elements.  For  a  CG  method,  EP  includes  vertex,  edge,  and  face  connec¬ 
tions  between  processor  elements.  The  Np  by  Np  adjacency  matrix  A(P'1 
associated  with  Gp  is  derived  from  A '  as  follows:  the  element  a[p  =  1  if 
the  intersection  of  all  rows  i!  and  columns  j'  of  A'  such  that  i  =  V(ir)  and 
j  =  V(j')  has  at  least  one  nonzero  element;  otherwise,  a\p  =  0.  From  the 
inter-processor  adjacency  matrix  A(I>) ,  the  necessary  communication  data 
structures  are  constructed.  Specifically,  the  neighbors  of  processor  element  i 
are  simply  the  non-zero  columns  of  row  i.  while  the  number  of  neighbors  for 
element  i  is  given  by  zCj=Ti  aiP  ■  Thus,  a  low-communication  partition  will 
have  an  adjacency  matrix  Ail>)  that  is  as  sparse  as  possible. 

4-2.  Parallelization:  CG 

We  first  consider  the  parallelization  of  CG.  The  global  DSS  operator  in 
Eq.  (25)  requires  inter-processor  communication  due  to  the  G°  requirement 
of  elements  at  processor  element  boundaries.  A  global  DSS  is  required  in  two 
parts  of  the  code:  construction  of  the  mass  matrix  and  construction  of  the 
right-hand  side  (RHS).  To  perform  this  communication,  we  first  construct  the 
boundary  nodes  of  each  processor  (excluding  the  physical  boundary  where 
BCs  are  applied).  Denote  the  boundary  of  processor  element  i  by  .  In 
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(a)  EBG  Grid 
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(b)  Generalized  Adjacency  (c)  Adjacency  Graph 
Graph  (CG)  (DG) 


Figure  1:  Example  2D  grid  (left),  the  CG  associated  adjacency  graph  (center),  and  DG 
adjacency  graph  (right).  Since  CG  methods  utilize  nodal  communication,  the  generalized 
adjacency  graph  includes  both  edge  and  vertex  connectivity,  with  a  maximum  degree  of 
eight.  For  3D,  structured,  Cartesian  grids,  the  maximum  degree  is  26.  In  contrast,  typical 
DG  methods  utilize  face-based  communication  and  hence  only  require  the  adjacency  graph 
shown  in  the  right  panel.  For  3D  Cartesian  grids  the  maximum  degree  for  DG  is  only  6. 


order  to  communicate  between  processor  elements,  we  construct  two  data 
structures  send  and  recv.  Consider  a  particular  processor  i  and  neighboring 
processor  j.  send  contains  the  local  nodes  on  processor  i  that  are  sent  to  each 
neighboring  processor  j,  while  recv  contains  the  local  nodes  on  processor  j 
that  must  be  sent  to  i.  In  order  to  construct  send,  we  utilize  the  method 
shown  in  Algorithm  1.  The  corresponding  recv  structure  contains  the  same 
grid  points  as  send,  although  they  may  be  ordered  differently. 


Algorithm  1  Construction  of  MPI  send/receive  communication  data  struc- 
tures  for  both  CG  and  DG. _ 

for  all  NBHs  j  of  i  do 

MPISEND  dQf  <-  LGE «  (dQf  to  proc  j 
MPIRECV  dQf  <-  LGE®  (dQf)  to  proc  i 
B  <-  dQf  n  dQf 

L  J 

send(j)  <-  [ LGE^]~l(B ) 

end  for 


Once  these  data  structures  have  been  constructed,  the  global  mass  matrix 
and  RHS  operators  may  be  constructed  via  a  two  step  process.  The  global 
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mass  matrix  Mu  and  RHS  operator  in  Eq.  (25)  are  decomposed  as 


Ne  Np  jvip) 

M„  =  /\  M\f  =  A  A  M‘i]  and 

e=l  np=l  e'=l 

(36a) 

Ne  Np  N(ep) 

\> 

'of 

< 

< 

II 

'of 

< 

II 

05 

(36b) 

e—1  np=l  e'—\ 


Hence,  the  global  DSS  is  decomposed  into  a  local,  on-processor  DSS  and  a 
global  DSS,  that  requires  inter-processor  communication.  In  this  way,  global 
continuity  between  elements  is  preserved  during  the  construction  of  each 
RHS.  Note  that  the  communication  stencil  is  simply  the  boundary  of  each 
processor  element  and  is  independent  of  the  polynomial  order  N;  that 
is,  unlike  higher-order  finite  difference  or  finite  volume  methods,  the  spectral- 
element  method  is  halo-free.  We  note  however,  that  the  message  size  grows 
with  the  surface  area  of  the  processor  element,  or  N2.  In  summary,  the  global 
DSS  procedure  may  be  summarized  as  follows:  The  global  DSS  operation  is 


Algorithm  2  The  Direct  Stiffness  Summation  (DSS)  Operation  in  CG 

1.  Perform  a  local  DSS  on  processor  i. 

2.  Exchange  boundary  points  between  processors  i  and  all  processor  neigh¬ 
bors  j  using  (isynchronous)  isend  and  irecv. 

3.  Perform  a  global  DSS  using  the  boundary  data  received  from  neighbors 

j- 


represented  schematically  in  the  left  panel  of  Figure  2  in  a  2D  setting.  To 
simplify  the  discussion,  each  processor  is  assumed  to  own  one  element.  In 
order  to  construct  the  RHS  operator  on  the  boundary  (red  dots),  the  ele¬ 
ment  E  (green)  requires  nodal  information  for  its  8  nodal  neighbors.  These 
neighbors  include  both  edge  neighbors  (2,  4,  6,  and  8)  and  vertex  neighbors 
(1,  3,  5,  and  7).  In  a  3D  setting,  a  hexahedral  element  may  have  (a  maxi¬ 
mum)  of  6  face  neighbors,  12  edge  neighbors,  and  8  vertex  neighbors,  for  a 
total  of  26  nodal  neighbors.  Note  that  for  tetrahedral  grids,  the  number  of 
vertex  neighbors  may  be  much  greater.  To  reduce  this  communication  cost, 
METIS  is  used  to  reduce  the  total  number  of  vertex  neighbors;  in  a  typical 
3D  partition,  a  processor  element  lying  away  from  a  physical  boundary  has 
16  or  fewer  nodal  neighbors;  hence,  the  METIS-based  decomposition  further 
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(b)  DG  stencil 


Figure  2:  Computational  stencils  for  a)  CG  and  b)  DG  in  a  2D  setting,  where  each 
processor  element  owns  one  element.  In  the  left  panel,  the  element  E  (green)  requires 
nodal  information  for  its  8  nodal  neighbors  in  order  to  construct  the  DSS  operator  on 
the  boundary  (red  dots).  In  the  right  panel,  the  same  element  E  only  requires  nodal 
information  for  its  4  face  neighbors  in  order  to  construct  flux  integrals.  In  both  methods, 
the  interior  nodes  (yellow  dots)  do  not  need  to  be  communicated  to  adjacent  processors, 
resulting  in  a  halo- free  algorithm. 


reduces  communication  cost  relative  to  a  naive  geometric  domain  decompo¬ 
sition.  In  particular,  we  use  the  routine  METIS_PartGraphRecursive .which 
reduces  the  edge-cut  of  the  graph  G. 

EBG  methods  possess  purely  local  communication  stencils.  Referring 
to  the  left  panel  of  Fig.  2,  the  interior  nodes  (yellow  dots)  do  not  need  to 
be  communicated  to  adjacent  processors,  resulting  in  a  halo-free  algorithm. 
Thus  unlike  high-order  finite  difference  and  finite  volume  schemes,  spectral 
elements  may  achieve  spectral  accuracy  without  sacrificing  their  local  char¬ 
acter. 

4-3.  Parallelization:  DG 

Parallelization  of  DG  is  much  more  straightforward  than  CG.  Since  there 
are  no  global  mass  matrices,  all  that  is  required  is  a  way  to  communicate 
boundary  data  between  processor  elements.  In  addition  to  a  local/global 
element-mapping,  an  additional  local/global  face  structure  is  also  required. 
Faces  on  each  processor  element  (excluding  boundary  faces)  are  classified 
into  two  types:  intra-processor  and  inter-processor.  The  necessary  send  and 
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receive  data  structures  are  then  constructed  by  calculating  the  intersection 
of  face  boundaries  on  each  PE  as  described  in  a  manner  similar  to  Algorithm 

1.  Unlike  CG,  however,  the  DG  algorithm  requires  only  the  set  of  faces  / 
which  need  to  be  sent  to  adjacent  processors  and  not  the  actual  grid  points. 
Once  these  maps  and  data  structures  are  constructed,  the  flux  operators  are 
evaluated  using  the  following  algorithm,  which  employs  interleaved  compu¬ 
tation/communication: 


Algorithm  3  Interleaved  Computation  of  the  Flux  Operator  in  DG 

1.  Send/Receive  boundary  data  via  non-blocking  MPI  Send/Receive. 
While  waiting  for  boundary  data  do: 

2.  Compute  volume  integrals. 

3.  Compute  intra-processor  flux  integrals. 

4.  MPI.WaitO 

5.  Compute  inter-processor  flux  integrals. 

6.  Multiply  by  element-wise  inverse  mass  matrix. 


The  communication  stencil  is  illustrated  geometrically  in  the  right  panel 
of  Figure  2.  Like  CG,  DG  algorithms  do  not  require  internal  nodes  to  be 
communicated  to  adjacent  processors  and  is  hence  halo-less.  Moreover,  since 
all  communication  is  based  on  fluxes,  only  elements  which  share  adjacent 
faces  need  to  communicate.  This  compact  stencil  has  a  crucial  impact  on 
both  the  communication  volume  and  memory  access  patterns  of  DG  when 
compared  to  CG.  These  issues  are  explored  in  the  next  two  sections. 

We  note  that  the  focus  of  this  study  is  a  comparison  of  CG  and  DG  from 
an  algorithmic  point  of  view  and  that  the  parallelization  algorithms  outlined 
in  this  and  the  previous  section  are  not  optimal.  In  particular,  we  have  not 
implemented  sophisticated  scheduling  between  the  individual  MPI  processes 
using  MPI_Waitany  or  Mpi_Waitsome  ,  nor  have  we  exploited  topology  aware 
communication.  We  are  currently  addressing  these  implementation  details  in 
our  codes.  Therefore,  it  must  be  understood  that  both  methods  can  indeed 
be  further  optimized;  however,  the  results  we  present  represent  the  simplest 
approaches  for  constructing  parallel  strategies. 

4-4-  Communication  Volume 

A  simple  model  [8]  for  the  communication  time  (or  cost)  of  a  single  MPI 
process  is 

tc  =  ta(a  +  (3m )  (37) 
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where  ta  is  the  time  to  do  a  floating-point  operation,  a  is  proportional  to  the 
latency  (i.e. ,  message  startup  cost),  and  / 3  is  the  asymptotic  transfer  rate  (i.e. , 
inverse  bandwidth).  Typically,  the  latency  cost  a  is  larger  than  bandwidth 
cost  (3  by  one  to  three  orders  of  magnitude,  depending  on  the  architecture  of 
the  machine.  To  characterize  the  trade-off  between  bandwidth  and  latency, 
the  ratio  (3/a  provides  a  useful  metric;  a  typical  value  is  /3/a  =  0.01.  Using 
the  model  given  by  Eq.  (37)  the  communication  overhead  of  both  DG  and 
CG  may  be  quantitatively  analyzed. 

Consider  a  hexahedral  processor  element  (PE)  with  Ne  elements  dis¬ 
cretized  with  order  N  polynomials.  Geometrically,  we  state  that  the  length 
of  a  face,  edge,  and  vertex  message  is 


mf  =  nvarN2J\N  +  l)2 

(38a) 

me  =  nvarNl/3(N  +  1) 

(38b) 

TTiy  —  Tlvar 

(38c) 

where  nvar  =  5  is  the  number  of  state  variables. 

For  DG,  since  the  inter-processor  fluxes  only  require  face  communication, 
and  since  each  PE  has  an  average  of  six  face  neighbors,  the  total  communi¬ 
cation  cost  at  each  stage  of  the  RK  time-integrator  is 

CDG  =  taa  ^6  +  Q^nvarN2J\N  +  l)2^)  .  (39) 

Asymptotically,  we  see  that  the  communication  cost  scales  as  O  . 

In  contrast,  the  computation  volume,  which  is  dominated  by  the  construction 
of  volume  integrals,  scales  as  O  ( NeN 4);  hence,  for  large  N  and/or  large  Ne, 

the  ratio  of  communication  to  computation  O  ^AUl//3iV~2^  tends  to  zero, 
thereby  ensuring  weak  scaling. 

In  contrast,  for  CG  the  construction  of  the  DSS  operator  requires  edge 
and  vertex  communications  in  addition  to  face  communication.  Since  for  a 
hexahedral  PE,  there  are  12  edges  and  8  vertices,  the  communication  cost  is 

CCg  =  6 ta  ( a  +  (3m f )  +  12 ta  ( a  +  (3me)  +  8 ta  ( a  +  (3mv) 

=  taa  f  26  +  6—nvarN2/3(N  +  l)2  +  8—nvarN^3(N  +  1)  +  8 —nvar 
\  a  a  a 

(40) 
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As  with  DG,  the  communication  cost  scales  as  O  (Ne^3N2j  with  a  similar 
estimate  for  computation;  hence,  in  an  asymptotic  sense,  both  CG  and  DG 
have  the  same  communication  footprints.  Geometrically,  this  can  be  under¬ 
stood  since  the  surface  area  of  the  faces  of  a  PE  grow  faster  than  the  sizes  of 
edges  and  vertices.  These  asymptotic  estimates  are  not  significant  for  prac¬ 
tical  choices  of  N  and  Ne,  e.g.,  for  4  <  Ar  <  16  and  small  values  of  Ne  the 
intermediate  behavior  of  Eqs.  (39)  and  (40)  become  quite  important.  We  note 
briefly  that  the  communication  footprint  of  CG  methods  may  be  reduced  to  a 
footprint  similar  to  DG  using  the  d-directional  exchange  algorithm  proposed 
in  [10];  however,  implementation  of  this  algorithm  is  non-trivial  and  limited 
to  tensor  product  meshes,  and  is  therefore  not  explored  in  this  study. 

A  typical  ratio  of  bandwidth  to  latency  on  modern  computers  is  P/a  = 
0.01.  Consider  the  case  of  one  element  per  MPI  process,  Ne  =  1,  and  N  =  4. 
Then  Ccg/Cdg  ~  2.6.  Hence,  CG  has  a  significant  communication  overhead 
relative  to  DG  in  practice.  To  understand  this  communication  overhead  more 
extensively,  the  ratio  of  Eqs.  (40)  and  (39)  is  plotted  in  Figure  3  for  three 
different  values  of  (3/ a  and  a  range  of  polynomial  orders  N.  Several  trends 
are  evident  from  this  graph.  Firstly,  for  all  bandwidth/latency  ratios,  the 
communication  overhead  is  greatest  for  linear  (N  =  1)  polynomials  and  de¬ 
creases  monotonically  to  one  as  N  — >  oo.  This  behavior  is  expected  from 
the  asymptotic  estimates  made  above.  However,  from  a  practical  point  of 
view,  the  machine  architecture  and  MPI  implementation,  which  is  charac¬ 
terized  by  the  ratio  P/a,  determines  the  relative  communication  footprint  of 
CG  to  DG.  As  the  latency  costs  increase  relative  to  bandwidth  costs  (P/a 
decreases),  CG  incurs  greater  communication  relative  to  DG.  Since  the  gen¬ 
eral  trend  in  HPC  is  toward  greater  bandwidth  (1/P  is  large  therefore  p  is 
small)  and  greater  latency  (large  a),  then  DG  is  expected  to  significantly 
outperform  CG  in  terms  of  communication  footprint  as  P/a  decreases. 

Note  that  for  both  CG  and  DG,  the  communication  volume  is  O  (jSf/^N2^ 

while  the  computation  volume  is  O  (NeN4).  Hence,  for  large  number  of  ele¬ 
ments  and  (relatively)  large  orders  N,  the  computation  volume  overwhelms 
the  communication  volume.  Therefore,  in  the  DG  algorithm  described  in 
Algorithm  3,  steps  2  and  3  typically  require  much  more  time  than  the  com¬ 
munication  in  step  1.  For  this  reason,  DG  is  expected  to  scale  to  massive 
numbers  of  cores  in  a  distributed  memory  architecture. 
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Figure  3:  The  Ratio  of  communication  costs  Ccg/Cdg  plotted  for  transfer  rate/latency 
ratios  /3/a  =  0.1,  0.01  and  0.001  for  polynomial  orders  TV  =  1  to  16. 


4-5.  Memory  Access 

Fetching  data  from  memory  poses  a  serious  bottleneck  in  any  HPC  ap¬ 
plication.  The  latency  associated  with  reading  data  from  main  memory  can 
adversely  affect  the  efficiency  and  scaling  of  any  HPC  application.  In  general, 
EBG  methods  exhibit  a  high  level  of  data  locality  since  the  computational 
work  is  organized  into  discrete  elements.  In  this  section,  we  explore  the  im¬ 
pact  of  data  layout  and  access  in  CG  and  DG  and  how  this  data  access  affects 
performance. 

Figure  4  shows  the  grid  numbering  used  by  a)  CG  and  b)  DG  for  Ne  =  4 
and  N  =  3.  Let  us  assume  for  the  moment  that  CG  uses  a  global  indexing 
while  DG  must  use  a  local  indexing.  This  data  layout  has  an  impact  on  the 
amount  of  data  that  must  be  fetched  from  memory  and  stored  in  fast  cache 
for  an  EBG.  In  DG,  the  construction  of  volume  integrals  is  element-local. 
Since  the  data  for  each  element  lies  in  a  contiguous  region  of  memory,  all 
of  the  data  for  each  element  may  be  fetched  from  slow  memory  loaded  into 
fast  cache  at  the  beginning  of  an  element-wise  operation.  The  only  operation 
that  requires  access  to  a  neighboring  element’s  data  is  the  construction  of 
flux  integrals.  Hence,  during  the  construction  of  a  RHS,  the  number  of  cache 
misses  is  small. 

For  CG,  the  state  vector  is  typically  stored  in  a  continuous  block  of  mem¬ 
ory,  rather  than  in  an  element-wise  fashion  (it  is  possible  to  use  DG  storage 
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in  CG  but  here  we  discuss  only  the  standard  finite  element  numbering) .  Ob¬ 
viously,  this  data  layout  uses  less  memory  than  the  element-wise  DG  layout; 
however,  a  penalty  in  performance  may  be  incurred.  To  construct  a  RHS, 
first  data  must  be  fetched  from  the  global  state  vector  and  stored  in  a  lo¬ 
cal  array.  Cache  misses  may  occur  at  all  points  on  the  boundary  of  the 
element,  especially  for  large  grids,  where  there  is  a  large  stride  in  index¬ 
ing.  Element-wise  operations  are  then  performed,  followed  by  DSS,  which 
requires  updating  the  global  state  vector,  thus  resulting  in  additional  cache 
misses.  These  cache  misses  are  exacerbated  in  3D,  where  a  typical  element 
requires  data  from  26  adjacent  elements.  Although  these  cache  misses  may 
be  partially  alleviated  by  grid  renumbering,  a  CG  method  can  never  achieve 
the  natural  data  locality  of  a  DG  method.  Even  if  CG  uses  DG  storage,  the 
cache  misses  associated  with  the  DSS  operator  cannot  be  circumvented;  the 
DSS  operator  requires  indirect  memory  access  that  requires  striding  through 
the  memory  in  order  to  enforce  C°  continuity. 

For  optimal  performance,  the  entire  on-processor  problem  should  fit  in 
cache.  In  this  situation,  which  only  occurs  for  small  test  problems  and/or 
large  processor  counts,  no  cache  misses  will  happen.  A  rough  estimate  for  the 
size  of  the  on-processor  problem  is  made  for  both  DG  and  CG.  To  construct 
a  RHS  requires  the  following  data  structures:  1)  a  state  vector,  2)  reference 
vector,  3)  RHS  vector,  4)  the  inverse  mass  “matrix”,  and  5)  nine  metric 
terms  and  two  Jacobian  matrices  (for  both  volume  and  flux  integrals).  The 
data  structures  in  items  l)-3)  each  have  size  nvarNe(N  +  l)3  with  nvar  =  5, 
whereas  structures  4)  and  5)  have  size  Ne(N  +  l)3,  thus  yielding  26Ne(N  + 
l)3  real  numbers.  A  typical  64- bit  machine  uses  8  bytes  to  store  a  real, 
yielding  a  problem  size  of  208Ae(A^+l)3.  The  memory  requirements  for  CG  is 
slightly  smaller,  since  data  is  stored  using  a  global,  rather  than  element-based 
index;  however  the  magnitude  is  similar.  On  typical  clusters  (e.g.,  TACC’s 
Ranger),  each  core  (and  hence  each  MPI  process)  has  a  cache  with  size  512 
KB,  implying  that  a  single  element  can  fit  into  cache  provided  AI  <  12.  Using 
this  estimate,  approximately  20  continuous  elements  fit  into  cache  for  A”  =  4, 
while  about  3  elements  fit  into  cache  for  N  =  8. 

From  these  estimates,  the  number  of  cache  misses  for  both  CG  and  DG 
may  be  estimated.  For  DG,  the  number  of  cache  misses  may  vary  from  0  to  6. 
For  a  very  large  problem  or  a  small  number  of  MPI  processes,  a  cache  miss  will 
occur  during  each  flux  construction,  when  data  from  an  adjacent  processor 
is  accessed,  yielding  6  misses.  For  intermediate  sizes,  a  cache  miss  does 
not  occur  when  fetching  data  from  an  element  with  continuous  numbering, 
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Figure  4:  Grid  numbering  for  a)  CG  with  global  indexing  and  b)  DG  with  local  indexing 
for  Ne  =  4  and  N  =  3. 


yielding  2  or  4  misses;  finally,  for  very  small  problems,  all  the  data  is  in  cache, 
yielding  no  misses.  For  CG,  the  number  of  misses  ranges  from  0  to  52.  For 
large  problems,  reading  and  writing  any  data  from  an  adjacent  element  will 
incur  a  cache  miss,  yielding  2*26  =  52  misses.  For  smaller  problems,  this 
count  may  be  reduced  to  48  or  fewer  misses,  depending  on  the  configuration 
of  the  sub-domain.  Excluding  sub-domains  small  enough  to  fit  into  cache, 
the  number  of  cache  misses  associated  with  CG  will  be  many  times  larger 
than  the  misses  associated  with  DG. 


5.  Results 

5.1.  Test  Cases 

Although  a  standard  set  of  2D  mesoscale  test  cases  has  been  proposed 
[33]  and  later  used  within  an  element-based  Galerkin  framework  [14],  an 
analogous  suite  of  3D  test  cases  has  not  yet  been  developed.  Fortunately,  a 
3D  dynamical  core  can  be  run  in  2D  mode  by  imposing  symmetry  in  the  y- 
dimension  and  periodic  boundary  conditions  in  the  ?/- lateral  boundaries.  For 
initial  verification  of  NUMA,  we  used  the  test  cases  proposed  in  [33] ;  later, 
we  ran  full  3D  test  cases  with  no  ^/-symmetry.  In  the  following  analysis,  we 
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consider  two  test  cases:  a  3D  linear  hydrostatic  mountain  and  a  3D  rising 
thermal  bubble. 


5.1.1.  3D  Linear  Hydrostatic  Mountain 

To  test  the  3D  capabilities  of  NUMA  and  the  NRBCs,  we  consider  strat¬ 
ified  flow  past  an  isolated  mountain  as  outlined  in  [34] .  An  initial  horizontal 
flow  U  =  20  m/s  goes  past  a  mountain  with  orography  given  by 

h(x>  V)  =  i  2  /  2  ,  17  2  ,  i  W2  (41) 

(. x2  ja 2  +  y2 /a2,  +  1  )6'2 

with  mountain  half-width  a  =  10  km  and  height  h0  =  1  m.  The  hydro¬ 
static  background  is  specified  by  a  constant  Brunt- Vaisala  frequency  Nbv  = 
gj -J cpT0  with  ground  temperature  T0  =  250  K.  In  other  words,  we  con¬ 
sider  an  isothermal  atmosphere.  No  flux  boundary  conditions  are  imposed 
on  the  bottom  of  the  domain,  while  NRBCs  are  imposed  on  the  four  lateral 
boundaries  and  the  top  boundary.  In  order  to  verify  our  numerical  results, 
several  analytical  results  from  [34]  and  [35]  are  used.  First,  a  contour  inte¬ 
gral  solution  for  the  density  perturbations  p'  valid  under  a  linear  Boussinesq 
approximation  is  utilized.  Also,  the  velocity  perturbations  parallel  and  per¬ 
pendicular  to  the  flow  (V  and  v')  are  known  for  observation  points  near  the 
ground  (z  =  0)  (see  Eqs.  (39)  and  (41)  in  [34]): 


^  (d  V )  A)  // Llbv 

v'(x,  y,  0)  =  hNbv 
under  the  same  approximation. 


x/a 

1  +  x2/a2  +  y2  /  a2 

y/g 

1  +  x2 / a2  +  y2/a2 


(42a) 

(42b) 


5.1.2.  3D  Rising  Thermal  Bubble 

We  consider  a  3D  buoyant  thermal  bubble  rising  in  a  neutrally  stratified 
atmosphere  [38],  which  is  the  3D  extension  of  the  2D  thermal  bubble  origi¬ 
nally  considered  in  [36].  The  hydrostatic  potential  temperature  9q(z)  =  300 
K  (neutral  atmosphere)  is  perturbed  with  a  sphere  of  radius  rc  =  250  m 
centered  at  ( xc ,  yc,  zc )  =  (500,  500,  260)  m  by  a  cosine  taper  given  by 


9'  =  A 


1  +  cos 


(43) 
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Figure  5:  Comparison  of  the  density  perturbations  p ’  for  the  isolated  linear  hydrostatic 
mountain  using  1)  NUMA  (solid  line)  and  2)  a  contour  integral  solution  (dashed  line)  [35]. 
In  panel  a),  the  //  contours  in  the  plane  y  =  120000  m  are  shown  (the  analytical  model 
are  solid  while  the  CG  model  are  dashed  -  the  DG  results  are  similar  to  those  for  CG  and 
are  not  shown).  In  panel  b),  p ’  is  shown  as  a  function  of  z  using  x  =  y  =  120000  m  using 
numerical  results  produced  by  the  CG  and  DG  models,  and  Smith’s  analytical  model. 

where  r  =  ||x  —  xc||2  where  ||  •  ||2  denotes  the  2-norm,  and  A  =  |  is  a  constant. 
Unlike  the  test  case  used  in  [38],  our  problem  has  a  C 1  initial  condition,  thus 
mitigating  unphysical  oscillations  at  the  interface  of  the  bubble.  The  domain 
is  defined  as  (x,  y,  z)  E  [0, 1000]3  m.  No-flux  boundary  conditions  are  applied 
on  all  six  boundaries. 

5.2.  Numerical  Verification 

The  numerical  verification  proceeds  in  three  steps.  In  phase  one,  we  ran 
the  code  in  pseudo-2D  mode  using  1  element  and  periodic  boundary  condi¬ 
tions  in  the  ^-direction.  The  numerical  results  for  the  standard  mesoscale 
suite  [33]  are  directly  compared  to  the  results  of  our  existing  2D  model  [14]. 
In  phase  two,  we  considered  the  linear  isolated  mountain  problem,  which  pos¬ 
sesses  an  approximate  analytical  solution  in  the  form  of  a  contour  integral. 
In  phase  three,  we  consider  a  three-dimensional  buoyant  convection  problem. 
Although  this  problem  does  not  have  an  analytic  solution,  its  qualitative  be¬ 
havior  is  well  understood. 
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Figure  6:  Comparison  of  the  a)  down-stream  velocity  perturbation  u'  and  b)  cross-stream 
velocity  perturbation  v'  for  the  isolated  mountain  at  ground  level.  Agreement  between  the 
CG  and  DG  models  and  the  analytical  formulas  given  by  Eq.  (42)  is  satisfactory,  especially 
for  the  cross-stream  velocity  perturbation. 

5.2.1.  3D  Linear  Hydrostatic  Mountain 

In  order  to  test  the  3D  operators,  orography,  and  non-reflecting  boundary 
conditions,  flow  over  a  3D  isolated  linear  hydrostatic  mountain  was  consid¬ 
ered.  This  problem  may  be  solved  under  the  linear  Boussinesq  approximation 
via  a  contour  integral  technique  as  described  in  [35].  In  addition,  closed  form 
expressions  for  both  the  down-stream  and  cross-stream  velocity  components 
may  be  used  to  judge  the  numerical  solution  but  only  in  a  qualitative  sense. 
We  used  20  elements  in  the  x  and  y  dimensions  and  10  in  the  z  direction  with 
eighth-order  polynomials  yielding  effective  resolutions  of  Ax  =  Ay  =1.5  km 
and  A z  =  300  m.  The  explicit  time-integrator  was  run  at  the  maximum 
allowable  time-step  for  each  method. 

Figure  5  compares  the  density  perturbations  p'  of  the  analytic  and  numer¬ 
ical  solutions.  In  panel  a),  contours  for  p’  (analytic  are  solid  while  NUMA-CG 
are  dashed)  are  shown  after  5  hours  of  simulation  time  and  compared  to  the 
contour  integral  solution;  after  5  hours  the  models  have  converged  to  steady- 
state.  The  results  of  NUMA-DG  are  similar  to  the  dashed  contours  in  Fig.  5a. 
Panel  b)  of  this  figure  compares  the  numerical  results  produced  by  NUMA- 
CG,  NUMA-DG,  and  Smith’s  analytical  model  along  the  line  x  =  y  =  120 
000  m.  Agreement  is  very  good  near  the  mountain,  while  the  two  solutions 
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begin  to  deviate  as  2  increases  due  to  the  influence  of  the  sponge.  Agreement 
between  NUMA  and  Smith’s  results  may  be  brought  into  closer  agreement 
by  increasing  the  resolution  of  the  model;  at  least  this  is  the  case  in  the  2D 
analog  of  the  problem  (see  [14]  and  [15]);  recall  that  the  “analytic”  solution 
used  here  is  for  a  Boussinesq  model  and  therefore  one  should  not  expect  to 
get  exact  agreement.  Additional  verification  was  performed  by  comparing 
the  velocity  on  the  surface  of  the  mountain.  The  down-stream  velocity  per¬ 
turbation  u'  and  cross-stream  velocity  perturbation  v'  at  the  ground  for  both 
the  CG  and  DG  models  are  compared  with  the  analytical  formulas  in  Eq. 
(42).  Figure  6  displays  the  results  of  this  comparison  after  t  =  5  hours  of 
integration.  Agreement  between  the  CG  and  DG  models  and  the  analyti¬ 
cal  formulas  given  by  Eq.  (42)  is  satisfactory,  especially  for  the  cross-stream 
velocity  perturbation.  We  attribute  the  slight  deviation  of  the  down-stream 
velocity  to  the  sponge  layer  in  the  model,  which  drives  the  total  down-stream 
velocity  to  a  constant  velocity  of  20  m/s  at  the  boundary  of  the  computa¬ 
tional  domain.  This  additional  forcing  term  affects  the  quality  of  the  solution 
near  the  sponge  layer.  In  fact,  for  this  particular  case,  the  NRBCs  dominate 
the  solution;  this  exact  same  behavior  was  seen  in  the  2D  mountain  cases  in 
the  NUMA2D  codes  presented  in  [14], 

5.2.2.  3D  Rising  Thermal  Bubble 

Finally,  the  results  of  the  buoyant  convection  experiment  are  shown  in 
Figures  7  and  8.  Numerical  results  using  both  NUMA-CG  and  NUMA-DG 
are  shown.  A  total  of  103  elements  with  N  =  8  polynomials  were  used, 
yielding  an  effective  resolution  of  12.5m.  Due  to  the  coarse  resolution  of 
this  run,  a  small  amount  of  artificial  diffusion  v  =  0.5  m2/s  was  used  to 
suppress  grid  noise.  In  the  CG  code,  a  time-step  of  At  =  0.01  s  was  used, 
whereas  the  DG  code  required  At  =  0.005.  A  smooth  potential  temperature 
perturbation  in  an  initially  neutral  atmosphere  generates  vertical  updrafts 
and  shear  velocities,  which  cause  the  bubble  to  rise,  deform,  and  transition 
to  turbulence.  Fig.  7  displays  x-z  cross-sections  of  the  potential  temperature 
perturbation  for  t  =  0,  200,  and  400  seconds,  while  Fig.  8  displays  a  ID 
cross-section  along  the  line  x  =  y  =  500  m  for  both  CG  and  DG.  Note 
that  the  CG  result  displays  significant  Gibbs  oscillations,  while  the  Gibbs 
oscillations  are  less  pronounced  in  the  DG  code;  this  behavior  is  expected, 
since  DG  is  known  to  capture  sharp  gradients  more  effectively.  The  rising 
thermal  bubble  problem  tests  the  models’  ability  to  capture  the  effects  of 
turbulence  and  turbulent  convection.  Although  no  analytical  solution  exists 


29 


0.45 


E 

N 


x(m) 

(a)  t  =  0  s  (DG) 


1000 

800 

_  600 
E 

N 

400 

200 


0  200 


400  600 

x  (m) 


0.3 

0.25 

0.2 

0.15 


800  1000 


(b)  t  =  0  s  (CG) 


0  200  400  600  800  1000 


x(m) 

(c)  t  =  200  s  (DG) 


1000' 

800 

_  600 
E 

N 

400 

200 


0  200 


400  600 

x  (m) 


0.3 

0.25 

0.2 

0.15 


800  1000 


(d)  t  =  200  s  (CG) 


Figure  7:  Evolution  of  a  3D  rising  thermal  bubble  problem  ( x  —  ^-slices  of  the  potential 
temperature  perturbation  O')  in  the  y  =  500  m  plane  for  t  =  0,  200,  and  400  seconds  using 
both  NUMA-CG  and  NUMA-DG.  A  grid  consisting  of  Ne  =  103  elements  with  N  =  8 
polynomials  was  used. 


Figure  8:  Numerical  results  for  rising  thermal  bubble  problem  along  the  line  x  =  y  =  500 
m  using  both  NUMA-CG  and  NUMA-DG. 


for  this  problem,  the  numerical  results  are  physically  plausible  and  resemble 
previous  3D  bubble  experiments  in  [38]  and  [1].  Similar  results  are  seen  for 
both  the  NUMA-CG  and  NUMA-DG  codes. 

For  this  test  case,  the  relative  conservation  of  both  mass  (M)  and  energy 
(E)  were  calculated  for  both  the  CG  and  DG  codes.  In  both  codes  M  ~  10-13 
and  E  ~  10-7.  In  other  words,  both  CG  and  DG  conserve  mass  to  a  level 
approaching  machine  precision,  whereas  energy,  which  is  not  a  conservation 
variable  in  either  model,  is  not  conserved.  Hence,  there  is  no  particular 
disadvantages  to  using  a  non- conservative  form  of  the  equations  within  the 
CG  model. 

5.3.  Parallel  Performance 

Optimized  versions  of  both  NUMA-CG  and  NUMA-DG  have  been  de¬ 
ployed  on  TACC’s  Ranger  Sun  Constellation  cluster  [2],  It  is  important  to 
understand  that  there  may  exist  more  optimal  parallel  implementations  and 
so  our  results  are  illustrative  of  two  very  specific  choices  for  constructing 
parallel  algorithms  (as  described  in  Sec.  4).  However,  the  specific  choices  we 
have  made  in  our  parallel  implementation  are  either  standard  or  simple  in 
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that  they  represent  the  most  obvious  ways  of  optimizing  both  the  CG  and 
DG  methods. 

Two  test  problems  using  the  rising  thermal  bubble  were  executed  using 
a  fixed  grid  comprised  of  323  =  32768  elements  with  both  N  =  4  and  N  =  8 
polynomials.  The  IV  =  4  simulation  has  an  effective  resolution  of  7.8  km  in 
both  the  horizontal  and  vertical,  while  for  N  =  8  the  effective  resolution  is 
3.9  km.  The  maximum  allowable  stable  time  step  was  used  for  both  the  CG 
and  DG  models  (note  that  the  CG  model  uses  a  time-step  twice  as  large  as 
DG  and  the  N  =  4  simulations  allow  a  time-step  twice  as  large  as  those  for 
N  —  8).  Due  to  limitations  on  the  available  computational  resources,  only 
one  run  at  each  processor  count  was  feasible.  Also,  due  to  the  size  of  this 
problem,  running  the  code  in  serial  was  not  feasible;  therefore,  the  wallclock 
time  for  each  code  at  16  cores  was  assumed  to  be  16  times  the  serial  wallclock 
time. 

5.3.1.  Results  for  fth  and  8th  Order  Polynomials 

Figure  9  displays  both  a)  the  wallclock  time  and  b)  speedup  for  the  CG 
and  DG  codes  using  N  =  4.  In  panel  b),  the  ideal  speedup  (perfect  scalabil¬ 
ity)  is  also  shown.  Both  codes  exhibit  nearly  perfect  linear  scaling  to  2048 
cores.  Beyond  2048  cores,  however,  the  scaling  of  the  CG  code  begins  to 
degrade  while  the  DG  code  continues  to  scale  ideally  up  to  8192  cores.  At 
16384  cores,  the  DG  code  scalability  plateaus.  Let  us  now  see  what  happens 
when  we  increase  the  polynomial  order. 

Figure  10  displays  the  same  data  for  N  =  8.  Both  the  CG  and  DG  codes 
exhibit  nearly  perfect  linear  scaling  to  512  cores.  Between  512  to  2048  the 
CG  code  loses  perfect  scaling  but  recovers  beyond  2048  cores.  Note  that 
the  DG  code  exhibits  perfect  linear  scaling  all  the  way  through  to  32768 
processors  which  is  the  maximum  number  of  processors  that  a  32s  element 
simulation  can  use  (at  this  point,  each  processor  contains  only  one  element). 

5.3.2.  Communication  Cost 

The  difference  in  the  scaling  behavior  between  the  two  codes  can  be  ex¬ 
plained  by  examining  the  relative  communication  footprints  of  both  CG  and 
DG.  In  3D,  each  CG  processor  element  (PE)  may  have  up  to  26  neighbors, 
whereas  DG  has  only  6;  hence,  NUMA-DG’s  communication  footprint  is 
more  than  4  times  as  compact  as  NUMA-CG.  In  addition,  DG  has  a  more 
local  memory  access,  since  DG  stores  data  on  an  element-by-element  basis. 
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Hence,  only  the  flux  computations  require  non-local  memory  fetches;  in  con¬ 
trast,  CG  utilizes  a  global  index,  and  requires  many  more  non-local  memory 
accesses  although  this  overhead  can  be  eliminated  if  DG-type  local  indexing 
is  used. 

Effect  of  Polynomial  Order.  Comparing  Figs.  9  and  10,  we  also  see  the  depen¬ 
dence  of  scalability  on  the  order  of  the  method.  The  number  of  grid  points  for 
a  high-order  EBG  method  is  O  (NeN3),  whereas  the  computation  volume  is 
O  (NeN4)\  hence,  the  computation  density  is  O  ( N ).  In  contrast,  the  commu¬ 
nication  volume,  as  described  in  Sec.  4.4,  grows  only  as  O  ^Ay  ^3Ar2  j  .  Hence, 

as  we  increase  the  order  N,  the  computation  density  grows  as  N  whereas  the 
communication  density  (ratio  of  communication  volume  to  the  number  of 
grid  points)  decreases  as  1/N.  From  a  computer  architecture  viewpoint,  the 
data  locality  for  both  CG  and  DG  increases  with  order,  which  is  advantageous 
for  both  distributed  memory  implementation  as  well  as  shared-memory  im¬ 
plementations  using  graphical  processor  units  (GPUs)  [25].  A  higher-order 
method,  while  more  expensive  at  low  processor  counts,  becomes  more  effi¬ 
cient  as  the  processor  count  increases. 

For  example,  at  16  MPI  processes,  the  DG  code  requires  650  seconds 
of  wallclock  time  at  A"  =  4  and  15000  seconds  at  A  =  8  (both  using  323 
elements),  yielding  a  ratio  of  23.7  for  the  wallclock  times,  i.e. , 


WTn=8 
WTn= 4 


23.7 


NPROC= 16 


where  WT  is  the  wallclock  time.  In  contrast,  at  8192  MPI  processes,  the 
same  code  requires  1.5  seconds  at  A  =  4  and  25  seconds  at  A  =  8,  for 
a  ratio  of  i?si92  =  16.6;  therefore  going  from  16  to  8192  has  decreased  the 
computational  cost  ratio  between  the  4th  and  8th  order  simulations.  We 
attribute  this  phenomena  to  the  low  communication  volume  and  increased 
data  locality  (e.g.,  greater  on- processor  work  per  grid  point)  which  is  achieved 
by  DG  as  order  A  increases.  Of  course,  the  same  arguments  holds  for  CG 
although  the  curves  are  not  as  straight  as  they  are  for  DG. 

Note  that  for  a  fixed  number  of  elements  (in  this  case  323  elements)  and 
increasing  the  polynomial  order  from  A  =  4  to  A  =  8  increases  the  size 
of  the  problem  by  a  factor  of  32:  the  number  of  grid  points  has  increased 
by  a  factor  of  8  and  the  number  of  time-steps  by  a  factor  of  4.  Although 
the  A  =  8  simulation  is  32  times  larger  than  the  A  =  4  simulation,  the 
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additional  cost  of  increasing  the  order  N  decreases  as  we  move  to  larger  core 
counts.  For  example,  the  minimum  wallclock  time  in  Fig.  9  for  N  =  4  is 
1.5  seconds  (corresponding  to  8192  processors)  while  the  minimum  wallclock 
time  in  Fig.  10  for  IV  =  8  is  7  seconds  (corresponding  to  32768  processors). 
In  this  case,  we  see  that  the  cost  ratio  between  8th  and  4th  order  is  now  4.7. 

Complexity  Analysis:  A  Geometrical  View.  One  other  way  of  interpreting  the 
results  in  Figs.  9  and  10  involves  quantifying  the  operation  count  that  takes 
place  on-processor  versus  the  size  of  the  message  that  has  to  be  communicated 
across  processors.  Geometrically,  this  can  be  viewed  as  the  ratio  between 
the  computational  volume  (volume  of  the  PE)  and  communication  footprint 
(surface  area  of  the  PE).  To  simplify  the  following  discussion  let  us  first  define 
the  computational  volume  on  a  specific  hexahedral  processor  element  (PE) 
as 

VPE  =  O  (Nenvar(N  +  l)4) 

while  the  surface  area  of  this  HPE  that  defines  the  communication  cost  is 

SPE  =  O  (N2J3nvar(N  +  l)2) 

where  Ne  denotes  the  number  of  elements  per  PE,  nvar  are  the  number  of 
variables  in  our  equations  ( nvar  =  5  in  our  case),  and  N  denotes  the  order 
of  the  polynomial;  these  relations  come  from  Eq.  (38).  Note  that  the  term 
A^3  represents  the  number  of  faces  of  a  PE  (i.e.,  the  convex  hull  of  the  PE). 
Using  these  two  relations,  we  can  now  define  the  ratio 

77 npe  =  ^  =  Q  (Ne[PE\W{N  +  l)2) 

OPE 

where  Ne[PE\  denotes  that  (for  a  fixed  grid)  the  number  of  elements  per 
processor  Ne  is  a  function  of  the  number  of  processors  used  (PE);  this  ratio 
we  now  use  to  compare  Figs.  9  and  10. 

Figure  9  shows  that  for  N  =  4  both  CG  and  DG  lose  perfect  scalability 
between  8192  and  16384  processors.  Using  a  grid  with  32s  total  elements 
we  can  determine  that  8192  processors  gives  Ne  =  4  and  for  16384  proces¬ 
sors  yields  Ne  =  2.  From  these  values  we  determine  that  77|192  =  39  and 
77f6384  =  31.  In  contrast  for  N  =  8,  778192  =  127  and  77f6384  =  101  and 
scalability  is  maintained  (Fig.  10).  This  tells  us  that  for  the  entire  range 
of  possible  number  of  processors,  for  N  =  8,  the  ratio  of  the  on-processor 
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work  to  inter- processor  communication  is  >  100  which  means  that  the  com¬ 
munication  is  overwhelmed  by  the  computation  and  we  therefore  see  perfect 
scalability.  The  N  —  4  results  tells  us  that  a  factor  of  40  or  less  is  not  suffi¬ 
cient  to  maintain  scalability.  Looking  again  at  Fig.  9  we  see  that  for  N  =  4 
we  maintain  perfect  linear  scalability  for  both  CG  and  DG  at  2048  or  fewer 
processors.  For  these  values  we  determine  that  7£|048  =  62  so  we  can  conjec¬ 
ture  that  the  factor  of  on-processor  work  to  inter-processor  communication 
can  be  as  low  as  62  while  still  maintaining  scalability. 

5.3.3.  Memory  Access 

The  previous  discussion  has  focused  on  the  effects  of  load-balancing  and 
communication  on  scaling.  However,  memory  access  must  also  be  considered 
in  order  to  achieve  maximum  parallel  efficiency.  In  particular,  we  wish  to  fit 
the  local  problem  into  the  fast  cache  on  each  compute  core  in  order  to  mini¬ 
mize  the  number  of  fetches  from  random  access  memory  (RAM),  since  each 
fetch  may  consume  hundreds  of  clock  cycles.  To  determine  the  core  count  at 
which  the  local  problem  fits  into  cache,  consider  that  the  approximate  size 
of  each  state-vector  for  the  global  problem  (in  bytes)  for  DG,  which  uses  an 
element-local  storage  scheme,  is  approximately  8 nvarN§(N  +  l)3,  where  the 
coefficient  8  is  the  size  of  a  real  number  and  A|  is  the  global  number  of  ele¬ 
ments.  Since  we  need  to  store  both  the  state- vector  and  a  RHS,  the  total  size 
of  the  global  problem  for  N  =  8,  iVf  =  32s  and  nvar  =  5  is  approximately  2 
GB.  The  size  of  the  global  problem  for  CG  is  slightly  smaller  due  to  a  global 
storage  scheme,  although  the  difference  is  small  for  high  polynomial  orders. 
Each  quad-core  processor  on  Ranger  has  6  MB  of  shared  L3  cache  and  each 
core  has  512  KB  of  dedicated  L2  cache.  Hence  the  total  cache  per  core  on 
Ranger  is  2  MB.  Assuming  good  load  balancing,  the  local  problem  will  fit  into 
cache  using  1000  cores.  Performing  the  same  calculation  for  N  =  4,  we  see 
the  local  problem  will  fit  into  cache  using  approximately  160  cores.  There¬ 
fore,  we  expect  to  see  a  super-linear  speedup  at  these  core  counts,  which  is 
evident  in  Figs.  9  and  10  (see  panels  b)  where  the  CG  and  DG  simulations 
exceed  the  ideal  scaling  curve). 
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Figure  9:  Scaling  study  performed  using  the  NUMA-CG  and  NUMA-DG  codes  for  the 
RTB  problem  using  32s  elements  and  N  =  4  polynomials.  The  wallclock  time  refers  to 
the  total  simulation  time. 


(a)  wallclock  time  (b)  speedup 

Figure  10:  Scaling  study  performed  using  the  NUMA-CG  and  NUMA-DG  codes  for  the 
RTB  problem  using  323  elements  and  N  =  8. 
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6.  Discussion  and  Conclusion 


6.1.  Future  Work 

6.1.1.  Microphysical  Parameterizations 

In  conjunction  with  the  Naval  Research  Laboratory-Monterey,  we  are  in¬ 
corporating  microphysical  parameterizations  into  NUMA.  Preliminary  exper¬ 
iments  using  the  Kessler  scheme  [22]  have  been  conducted  within  a  2D,  serial 
implementation  [12].  Since  physical  parameterizations  operate  on  columns  of 
data  independently  of  adjacent  data,  the  problem  is  embarrassingly  parallel 
provided  the  domain  is  decomposed  in  the  horizontal  only  such  that  all  z 
values  reside  on-processor.  To  facilitate  scaling  on  hybrid  shared-distributed 
memory  architectures  (such  as  TACC’s  Ranger),  hierarchical  domain  de¬ 
composition  is  desirable,  whereby  MPI  communication  based  on  the  algo¬ 
rithm  developed  in  the  present  paper  is  utilized  in  the  horizontal  and  either 
OpenMP  parallelization,  appropriate  for  shared  memory,  or  graphical  pro¬ 
cessor  units  (GPUs)  are  employed  for  fine-grained  parallelism  in  the  vertical. 
For  such  problems  where  the  domain  decomposition  is  only  performed  in  the 
xy  plane  such  that  all  z  values  are  on-processor  both  the  CG  and  DG  methods 
will  always  have  a  much  larger  computational  volume  versus  communication 
footprint  that  will  allow  both  methods  to  scale  perfectly  for  the  maximum 
number  of  processors  accommodated  by  a  2D  domain  decomposition. 

6.2.  Conclusion 

In  this  paper,  we  have  developed  a  nonhydrostatic  model  of  the  atmo¬ 
sphere  (NUMA)  based  on  both  CG  and  DG  discretizations  in  space  and  ex¬ 
plicit  discretization  in  time.  We  note  that  purely  explicit  time-discretization 
is  not  feasible  due  to  the  stringent  CFL  restriction;  therefore,  we  will  present 
the  results  of  our  implicit-explicit  time-integrators  in  a  future  work.  This 
paper  has  subjected  NUMA  to  a  couple  of  limited-area  simulations,  includ¬ 
ing  orographic  flow  and  buoyant  convection  problems.  The  results  of  these 
test  problems  are  in  agreement  with  either  previous  simulations,  analytical 
results,  or  physical  intuition. 

The  parallel  performance  study  described  in  Sec.  5  suggests  several  direc¬ 
tions  for  the  future  of  EBG  methods  (in  particular,  DG  methods).  Firstly, 
good  scalability  was  exhibited  for  both  the  CG  and  DG  codes  for  0(  104) 
cores,  with  near-linear  strong  scaling  for  the  DG  code  beyond  0(  104).  As 
described  in  Secs.  4.4  and  4.5,  we  attribute  these  scalability  results  to  a  large 
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computational  volume  relative  to  communication  volume  and  high  data  local¬ 
ity,  which  yields  a  low-cache  miss  percentage.  In  particular,  DG  exhibits  very 
low  communication  volume  and  outstanding  data  locality,  indicating  that  DG 
will  be  able  to  achieve  ideal  scaling  up  to  0(  105)  cores  with  no  modifications. 
Secondly,  these  experiments  indicate  that  high  order  EBGs  (e.g.  N  =  8  rel¬ 
ative  to  N  =  4)  become  more  efficient  at  higher  core  counts.  Hence,  as  we 
move  towards  the  exascale  epoch  in  which  core  counts  of  O  (105)  and  O  (106) 
are  rapidly  become  available,  these  high  order  EBGs  may  become  competi¬ 
tive  with  their  established  low-order  counterparts  (finite  element  and  finite 
volume  methods).  For  these  reasons,  high-order  EBG  methods  are  excellent 
candidates  for  next- generation  NWP  models. 
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